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November 2002 Proceedings of the 2002 ACM/IEEE conference on Supercomputing 

Additional Information: full citation, abstract, references , clings , index 
terms 

Contemporary microprocessors provide a rich set of integrated performance counters that 
allow application developers and system architects alike the opportunity to gather important 
information about workload behaviors. Current techniques for analyzing data produced from 
these counters use raw counts, ratios, and visualization techniques help users make 
decisions about their application performance. While these techniques are appropriate for 
analyzing data from one process, they do not scale easi ... 



Continuous profiling: where have ail the cycles gone? 

Jennifer M. Anderson, Lance M. Berc, Jeffrey Dean, Sanjay Ghemawat, Monika R. Henzinger, 
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Additional Information: MLcitatjon, abstract, references, citings, index 
terms 



Full text available: 1§ § pdf(259.3S KB) 



This article describes the Digital Continuous Profiling Infrastructure, a sampling-based 
profiling system designed to run continuously on production systems. The system supports 
multiprocessors, works on unmodified executables, and collects profiles for entire systems, 
including user programs, shared libraries, and the operating system kernel. Samples are 
collected at a high rate (over 5200 samples/sec. per 333MHz processor), yet with low 
overhead (1-3% slowdown for most workloads). A ... 

Keywords: performance understanding, performance-monitoring hardware, profiling, 
program analysis 
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Non-blocking loads are a very effective technique for tolerating the cache-miss latency on 
data cache references. In this paper, we describe several methods for implementing non- 
blocking loads. A range of resulting hardware complexity/performance tradeoffs are 
investigated using an object-code translation and instrumentation system. We have 
investigated the SPEC92 benchmarks and have found that for the integer benchmarks, a 
simple hit-under-miss implementation achieves almost all of the availabl ... 
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Diemory 

Daniei J. Scales, Kourosh Gharachorloo, Chandramohan A. Thekkath 

October 1996 Proceedings of the seventh international conference on Architectural 

support for programming languages and operating systems, volume 30 , 31 

Issue 5 , 9 

Full text available" f?3 ^df 1 49 -MB » Additional Information: full citation , abstract, references , citings , index 
ksj-fc : * terms 

This paper describes Shasta, a system that supports a shared address space in software on 
clusters of computers with physically distributed memory. A unique aspect of Shasta 
compared to most other software distributed shared memory systems is that shared data 
can be kept coherent at a fine granularity. In addition, the system allows the coherence 
granularity to vary across different shared data structures in a single application. Shasta 
implements the shared address space by transparently rewrit ... 

5 Continuous profiling: where have ail the cycles gone? 

Jennifer M. Anderson, Lance M. Berc, Jeffrey Dean, Sanjay Ghemawat, Monika R. Henzinger, 
Shun-Tak A. Leung, Richard L. Sites, MarkT. Vandevoorde, Carl A. Waldspurger, William E. 
Weihl 

October 1997 ACM SIGOPS Operating Systems Review , Proceedings of the sixteenth 

ACM symposium on Operating systems principles, Volume 31 issue 5 
Full text available: pdf(2.29 MB) Additional Information: full citation, references , citings , index terms 
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Ziya Aral, Ilya Gertner 

January 1988 ACM SIGPLAN Notices , Proceedings of the ACM/SIGPLAN conference on 
Parallel programming: experience with applications, languages and 

Systems, Volume 23 Issue 9 
Full text available- W[ pdfM 05 MB) Additional Information: fuJJ. citation, abstract, references., citifies, index 
' m "* s ~ — 1 . terms 

Debugging the performance of parallel applications is crucial for fully utilizing the potential 
of multiprocessor hardware. This paper describes profiling tools in Parasight, a 
programming environment that is geared towards non-intrusive performance analysis and 
high-level debugging. In Parasight, profilers execute as observer programs which run 
concurrently with the target program and monitor its behavior. This design grew out of our 
experience in debugging and monitoring the performance o ... 

Expjoiiin^ 

Glenn Ammons, Thomas Ball, James R. Larus 

May 1997 ACM SIGPLAN Notices , Proceedings of the ACM SIGPLAN 1997 conference 

on Programming language design and implementation, volume 32 issue 5 
Full text available- HJ pdfM 67 MB) Additional Information: full citation, abstract , references , citings, index 
' ™ """ terms 

A program profile attributes run-time costs to portions of a program's execution. Most 
profiling systems suffer from two major deficiencies: first, they only apportion simple 
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metrics, such as execution frequency or elapsed time to static, syntactic units, such as 
procedures or statements; second, they aggressively reduce the volume of information 
collected and reported, although aggregation can hide striking differences in program 
behavior.This paper addresses both concerns by exploiting the har ... 

8 Computation; Understanding the efficiency of GPU algorithms for matrix-matrix 
multjpHcatjpn 

K. Fatahalian, J. Sugerman, P. Hanrahan 

August 2004 Proceedings of the ACM SIGGRAPH/ EUROGRAPHICS conference on 
Graphics hardware 

Full text available: ^ | pdff 145.04 KB) Additional Information: full citation , abstract, references , index ter?r>s 

Utilizing graphics hardware for general purpose numerical computations has become a topic 
of considerable interest. The implementation of streaming algorithms, typified by highly 
parallel computations with little reuse of input data, has been widely explored on GPUs. We 
relax the streaming model's constraint on input reuse and perform an in-depth analysis of 
dense matrix-matrix multiplication, which reuses each element of input matrices O(n) 
times. Its regular data access pattern and highly para ... 
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David W. Wall 

May 1992 ACM Transactions on Programming Languages and Systems (TOPLAS), 

Volume 14 Issue 3 

Full text available- ^ r«df(2.86 MB) Additional Information: Ml cjtetion, abstract references, citings, index 
^ terms , review 

We have built a system in which the compiler back end and the linker work together to 
present an abstract machine at a considerably higher level than the actual machine. The 
intermediate language translated by the back end is the target language of all high-level 
compilers and is also the only assembly language generally available. This lets us do 
intermodule register allocation, which would be harder if some of the code in the program 
had come from a traditional assembler, out of sight of ... 

Keywords: RISC, graph coloring, intermediate language, interprocedural, optimization, 
pipeline scheduling, profiling, register allocation, register windows 
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ATOM (Analysis Tools with OM) is a single framework for building a wide range of 
customized program analysis tools. It provides the common infrastructure present in all 
code-instrumenting tools; this is the difficult and time-consuming part. The user simply 
defines the tool-specific details in instrumentation and analysis routines. Building a basic- 
block counting tool like Pixie with ATOM requires only a page of code. ATOM, using OM link- 
time technology, organizes the final execu ... 
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Pre-execution techniques have received much attention as aneffective way of prefetching 
cache blocks to tolerate the ever-increasingmemory latency. A number of pre-execution 
techniquesbased on hardware, compiler, or both have been proposed andstudied 
extensively by researchers. They report promising resultson simulators that model a 
Simultaneous Multithreading (SMT) processor. In this paper, we apply the helper threading 
idea ona real multithreaded machine, i.e., Intel Pentium 4 processor withHyp ... 

12 Memory-wall: Profile-guided post-link stride prefetching Q 
Chi-Keung Luk, Robert Muth, Harish Patil, Richard Weiss, P. Geoffrey Lowney, Robert Cohn 
June 2002 Proceedings of the 16th international conference on Supercomputing 



Data prefetching is an effective approach to addressing the memory latency problem. While 
a few processors have implemented hardware-based data prefetching, the majority of 
modern processors support data-prefetch instructions and rely on compilers to 
automatically insert prefetches. However, most prefetching schemes in commercial 
compilers suffer from two limitations: (1) the source code must be available before 
prefetching can be applied, and (2) these prefetching schemes target only loops with ... 

Keywords: address strides, data prefetching, memory latency, post-link optimizations, 
profiling 
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A path profile determines how many times each acyclic path in a routine executes. This 
type of profiling subsumes the more common basic block and edge profiling, which only 
approximate path frequencies. Path profiles have many potential uses in program 
performance tuning, profile-directed compilation, and software test coverage. This paper 
describes a new algorithm for path profiling. This simple, fast algorithm selects and places 
profile instrumentation to minimize run-time overhead. Instrument ... 

14 A general framework for prefetch scheduling in linked data structures and its 
application to multi-chain prefetching 
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Pointer-chasing applications tend to traverse composite data structures consisting of 
multiple independent pointer chains. While the traversal of any single pointer chain leads to 
the serialization of memory operations, the traversal of independent pointer chains provides 
a source of memory parallelism. This article investigates exploiting such interchain memory 
parallelism for the purpose of memory latency tolerance, using a technique called multi- 
chain prefetching. Previous work ... 

Keywords: Data prefetching, memory parallelism, pointer-chasing code 
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Ispike is post-link optimizer developed for thelntel®ltanium Processor Family (IPF) 
processors. ThelPF architecture poses both opportunities and challenges topost-link 
optimizations. IPF offers a rich set of performancecounters to collect detailed profile 
information at a low cost,which is essential to post-link optimization being practical. At the 
same time, the prediction and bundling features onlPF make post-link code transformation 
more challengingthan on other architectures. In Ispike, we have ... 
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Preventing execution of unauthorized software on a given computer plays a pivotal role in 
system security. The key problem is that although a program at the beginning of its 
execution can be verified as authentic, its execution flow can be redirected to externally 
injected malicious code using, for example, a buffer overflow exploit. We introduce a novel, 
simplified, hardware-assisted intrusion prevention platform. Our platform introduces 
overlapping of program execution and MAC verification. It ... 
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To cope with the increasing difference between processor and main memory speeds, 
modern computer systems use deep memory hierarchies. In the presence of such 
hierarchies, the performance attained by an application is largely determined by its memory 
reference behavior— if most references hit in the cache, the performance is significantly 
higher than if most references have to go to main memory. Frequently, it is possible for the 
programmer to restructure the data or code to achieve bet ... 
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This paper describes Embra, a simulator for the processors, caches, and memory systems 
of uniprocessors and cache-coherent multiprocessors. When running as part of the SimOS 
simulation environment, Embra models the processors of a MIPS R3000/R4000 machine 
faithfully enough to run a commercial operating system and arbitrary user applications. To 
achieve high simulation speed, Embra uses dynamic binary translation to generate code 
sequences which simulate the workload. It is the first machine simu ... 
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